Data Project

Explore your interests

The idea of this project is for you to find data and do something interesting with it that relates to the skills you have developed in this course.

The project is intentionally open ended- this is an opportunity for you to explore your interests, through using techniques or technologies that branch off those that were introduced in this class, datasets related to problems you face in your career or that you are curious about, or both.

A great project will demonstrate a complete “Data Science Workflow”, such as the one introduced at the beginning of the textbook, starting from multiple distinct datasets from public sources (if you have a private dataset that you want to use you may discuss that with me), data tidying, exploratory data analysis, cleaning, hypothesis generation, and lastly modeling. This course has not focused on modeling and the mathematical sophistication of your model will not factor heavily in your evaluation, but by this point in the semester you will have been exposed to enough modeling techniques in other courses to incorporate models within your project.

The output of your project should software that allows me to reproduce your work and a research report which describes what you did with the data and your findings, which could be contained within a github repository. You will also present your project in a short presentation in the final week of the semester.

Project Proposal

The first step of your project is to submit an initial proposal, which will help to ensure that your project is both worthwhile and feasible in the time frame of the last half of the semester. Your proposal should describe the area that you are interested in (and if you have a specific question already you can describe that too), the datasets you are going to use, and your data analysis plan. Your proposal should be a maximum of 2 pages (excluding figures and references), and should include three sections (each with target length 1-2 paragraphs):

  • Section 1 - Introduction: The introduction should describe the motivation for your project, the subject matter area of your dataset, and any general research questions you may already have.

  • Section 2 - Data: At the time of submitting the proposal you should have already identified some datasets that you plan to use in your analysis. Give a quick description of those datasets here, describe where the data can be found, the number of variables and observations, and/or the output of the glimpse() or skim() functions on the data.

  • Section 3 - Data analysis plan: Describe your initial approach to analyzing your data. How will your datasets need to be processed in for your to make use of all of them in the analysis? Which sort of comparisons do you plan to make in your exploratory data analysis (you can share early visualizations of this is you have them)? What statistical methods do you think are appropriate to address the questions that motivated your interest?

The proposal is preliminary, and you can include additional data or a different analysis as needs require.

Heilmeier's Questions

When writing proposals, I’ve found the following set of considerations, called “Heilmeier's Questions”, to be helpful to consider. Although this is just a short project, you might find them helpful as well, and useful for formulating future projects and proposals that you might have to complete. These questions were developed by Dr. George Heilmeier, who was the director of DARPA from 1975-1977. Heilmeier said that every proposal to DARPA needed to answer these questions clearly and completely in order to receive funding:

  1. What are you trying to do? Articulate your objectives using absolutely no jargon. What is the problem? Why is it hard?
  2. How is it done today, and what are the limits of current practice?
  3. What is new in your approach, and why do you think it will be successful?
  4. Who cares?
  5. If you’re successful, what difference will it make? What applications are enabled as a result?
  6. What are the risks?
  7. How much will it cost?
  8. How long will it take?
  9. What are the midterm and final “exams” to check for success? How will progress be measured?